Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[main] Add new method to track originating cluster of Backup #613

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

mallardduck
Copy link
Member

@mallardduck mallardduck commented Oct 31, 2024

This partial solves for #612

However it's important to note that while I wrote an RFE for this too - the root here is seeking to fix a bug. The bug is that "BRO itself doesn't do anything to give indications when In-Place Restore is acceptable to use".

The short version of the fix, is we add a few new status details and expose them via kubectl. And later make a change to rancher/dashboard to ensure UI can also benefit from this new context by adding a column.


To be discussed:

  • This change is very Backwards Compatible, can we ship a CRD change for the new status field in a 2.10.x release?
    • Feels like it's a grey area - but, probably OK.
    • k8s seems to be OK with full on schema additions (of spec too) as long as they're optional/nullalble. This is hopefully less impactful since users cannot control status.
    • The current result on a cluster w/o CRD change is:
      • is a new warning about originCluster not existing on status,
      • Status conditions are correctly set to represent unknown in-place viability.
  • Do we want to explore supporting an even more Backwards friendly method of detecting origin ID's with Filename?
    • For the best UX, we will account for existing backups with UID in filename. In other words, once the CRD is updated with new status field we'll add the UID we find from the filename (or do nothing if not matching expected filename format).

@mallardduck mallardduck requested a review from a team as a code owner October 31, 2024 23:27
@mallardduck mallardduck force-pushed the bro-metadata-context branch 2 times, most recently from 4967e19 to f0b6b30 Compare November 1, 2024 01:26
@mallardduck
Copy link
Member Author

mallardduck commented Nov 1, 2024

@jbiers - So i'm not moving it to in review yet, since this was a very rough first pass try.

However if you can take some time to get familiar with the changes and review the list of concerns I thought of (as well as consider/add your own) - that'd be great help to keep the ball rolling on this. In other words, while I could work on this more I'd love your feedback first before I spend more time on it.

@mallardduck
Copy link
Member Author

@alexandreLamarre - I've added you as a reviewer her as I'd like to get your take on this. When you have time LMK if you'd like to sync up and chat about in a huddle. Or feel free to review in your own time and give me feedback here.

Copy link
Contributor

@alexandreLamarre alexandreLamarre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly some small comments about making the code more simple,readable & testable.

Since this PR could be a stepping stone to fully allowing backups of backup CRDs themselves, storing the cluster origin metadata in the Status feels wrong.

But if we do intend to fully support taking backups of backup CRDs, then even storing the cluster origin metadata in the Spec section of the CRD feels wrong.

What happens when we have a list of remote backups say A, B, C, all taken from different origins?

If the current cluster matches origin C then our status will indicate the restore is viable even if they specify an older backup A or B. I know this hypothetical would be rare but I think it points to the fact that maybe we should think about this differently.

The current design labels the recurring backups as coming from this cluster only, whereas specific backups are specific to specific clusters

What if we embedded cluster origin metadata not in the CRDs, but the backup data itself?

This way the flow looks like

  • Each remote backup contains a cluster origin
  • Each restore pointing to a specific backup can determine before actually restoring objects if it is viable
  • We can add a flag to the restore spec to allow unsafe backups, ones where the cluster origin doesn't match
  • By default, we disallow cases where we could have unsafe restores
  • If the restore is unsafe, the status is surfaced through the restore CR
  • Users can then modify the restore CR to restore the backup even if it is unsafe after the fact, or opt in before creating the restore CR

From the end user's perspective it is somewhat clunkier to take a restore, then be informed that it is potentially unsafe, but as an end user I value being shown correct information more

Additionally storing, the cluster origin metadata in the backup data, allows us to add tests for each corner case in the e2e tests when taking a restore, producing a potentially less volatile new feature

Comment on lines +9 to +16
backupName string
hasClusterOriginID bool
clusterOriginID string
hasCurrentOriginCondition bool
currentOriginCondition bool
canInPlaceRestore bool
hasInPlaceRestoreCondition bool
currentInPlaceRestoreCondition bool
Copy link
Contributor

@alexandreLamarre alexandreLamarre Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having boolean fields injected during construction, could encapsulate

type backupClusterMdChecker struct{
   backupName string
   clusterOrigin string
   b *backupv1.Backup
   // Any controllers or client runtime needed for methods, if any
}

func (b backupClusterMd) hasClusterOriginID() bool {
  // ...
}

func (b backupClusterMd) hasCurrentOriginCondition() bool {
  // ...
}

This allows us to isolate some more complicated logic, as well as allowing for testing if we make these methods or struct public, which we probably should - because we should be focused on stability.

If making them public, could consider moving them to sub-package, pkg/controllers/backup/origin or something to prevent imports of heavy k8s dependencies in unit tests.

Also greatly simplifies the if statements below IMO.

dynamicClient: dynamicInterface,
defaultBackupMountPath: defaultLocalBackupLocation,
defaultS3BackupLocation: defaultS3,
canUseClusterOriginStatus: util.VerifyBackupCrdHasClusterStatus(clientSet.ApiextensionsV1()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would avoid having a function returning a value that uses API calls being passed to the handler. Instead would include the logic inside the function in the setup itself, before the call to Register

See comment below as well.

Comment on lines +118 to +121
clusterOriginChanged := h.prepareClusterOriginConditions(backup)
if clusterOriginChanged {
shouldUpdateStatus = true
}
Copy link
Contributor

@alexandreLamarre alexandreLamarre Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : can simplify using

if originChanged := h.prepareClusterOriginConditions(backup); originChanged{
   shouldUpdateStatus = true
}

Comment on lines +110 to +113
if err != nil {
logrus.Infof("Error fetching CRD: %v", err)
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a more ready version of this, I'd expect any errors like this to be surfaced in the setup step. And potentially exit since an error like that signifies a more serious problem or a transient error (in this case could consider a backoff mechanism)

@@ -37,6 +39,7 @@ type BackupSpec struct {

type BackupStatus struct {
Conditions []genericcondition.GenericCondition `json:"conditions"`
OriginCluster string `json:"originCluster,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced the origin cluster metadata should be stored in the Status field.

I can see from the original CRD definition that the semantic difference between spec (the desired state of the object) and the status (what is the current state of the object) is a little bit muddied in this operator.

IMO, the origin cluster (and additional metadata) could be set in the Spec and be fully managed by the operator

In any case, the backup CRD's status should be the one surfacing whether or not a backup is viable for sure 👍

Comment on lines +109 to +120
crd, err := getCRDDefinition(client, crdName)
if err != nil {
logrus.Infof("Error fetching CRD: %v", err)
return false
}

// Inspect the status schema, for example
_, found := crd.Spec.Versions[0].Schema.OpenAPIV3Schema.Properties["status"].Properties["originCluster"]
if found {
logrus.Debugf("Status schema contains `originCluster` on CRD `%s`.\n", crdName)
return true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to the above discussion, storing the cluster metadata in the backup CRD spec as a nullable obejct shouldn't be a breaking change, and we can instead check if it is nil or not, rather than doing dynamic CRD reflection

IMO it is easier to make it testable if we avoid checking crd definitions and focus on the presence/absence of a new field to determine that.

}

// prepareClusterOriginConditions helps set the cluster origin conditions and reports if anything changed in this part of status.
func (h *handler) prepareClusterOriginConditions(backup *v1.Backup) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of this logic in this function can also be simplified/more readable if we encapsulate most of it in struct methods

kubeSystemNS string
// TODO: rename to kubeSystemNamespaceUID; nit to improve clarity, it's not the string representation nor the NS resource
kubeSystemNS string
canUseClusterOriginStatus bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistically, should avoid storing additional boolean fields, in favor of a method.

I think this field might be useless?

If we opt into using the backup CRD spec for storing cluster metadata, and older backup-restore versions don't set this new null-able field, then we cannot determine where it is from.

currentOriginConditionString := condition.Cond(v1.BackupConditionClusterOrigin).GetStatus(backup)
if currentOriginConditionString != "False" {
condition.Cond(v1.BackupConditionClusterOrigin).SetStatusBool(backup, false)
condition.Cond(v1.BackupConditionClusterOrigin).Message(backup, "CRD not updated to include cluster UID yet.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this realistically happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least I think so. Granted it was during devworkflows, but I did run into it myself.

Maybe not as big a concern in prod, but I think could still happen - in a chart like BRO if they don't also update the CRDs. I.e. Installed via helm CLI directly.
Or even if (an edge case) where a user manually targets a newer image than the chart. However if we opt to not change the CRD to add it to status this wouldn't be needed anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants